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Abstract 

The success of deep learning often de¬ 
rives from well-chosen operational build¬ 
ing blocks. In this work, we revise the 
temporal convolution operation in CNNs 
to better adapt it to text processing. In¬ 
stead of concatenating word representa¬ 
tions, we appeal to tensor algebra and use 
low-rank n-gram tensors to directly exploit 
interactions between words already at the 
convolution stage. Moreover, we extend 
the n-gram convolution to non-consecutive 
words to recognize patterns with interven¬ 
ing words. Through a combination of low- 
rank tensors, and pattern weighting, we 
can efficiently evaluate the resulting con¬ 
volution operation via dynamic program¬ 
ming. We test the resulting architecture on 
standard sentiment classification and news 
categorization tasks. Our model achieves 
state-of-the-art performance both in terms 
of accuracy and training speed. For in¬ 
stance, we obtain 51.2% accuracy on the 
fine-grained sentiment classification task. ^ 

1 Introduction 

Deep learning methods and convolutional neural 
networks (CNNs) among them have become de 
facto top performing techniques across a range 
of NLP tasks such as sentiment classification, 
question-answering, and semantic parsing. As 
methods, they require only limited domain knowl¬ 
edge to reach respectable performance with in¬ 
creasing data and computation, yet permit easy 
architectural and operational variations so as to 
fine tune them to specific applications to reach top 
performance. Indeed, their success is often con¬ 
tingent on specific architectural and operational 
choices. 

*Our code and data are available at https : //github. 
com/taolei87/text_convnet 


CNNs for text applications make use of tem¬ 
poral convolution operators or filters. Similar 
to image processing, they are applied at multi¬ 
ple resolutions, interspersed with non-linearities 
and pooling. The convolution operation itself is 
a linear mapping over “n-gram vectors” obtained 
by concatenating consecutive word (or character) 
representations. We argue that this basic build¬ 
ing block can be improved in two important re¬ 
spects. First, the power of n-grams derives pre¬ 
cisely from multi-way interactions and these are 
clearly missed (initially) with linear operations on 
stacked n-gram vectors. Non-linear interactions 
within a local context have been shown to improve 
empirical performance in various tasks (Mitchell 
and Lapata, 2008; Kartsaklis et al., 2012; Socher 
et al., 2013). Second, many useful patterns are 
expressed as non-consecutive phrases, such as se¬ 
mantically close multi-word expressions (e.g.,“not 
that good”, “ not nearly as good”). In typical 
CNNs, such expressions would have to come to¬ 
gether and emerge as useful patterns after several 
layers of processing. 

We propose to use a feature mapping operation 
based on tensor products instead of linear opera¬ 
tions on stacked vectors. This enables us to di¬ 
rectly tap into non-linear interactions between ad¬ 
jacent word feature vectors (Socher et al., 2013; 
Lei et al., 2014). To offset the accompanying 
parametric explosion we maintain a low-rank rep¬ 
resentation of the tensor parameters. Moreover, 
we show that this feature mapping can be applied 
to all possible non-consecutive n-grams in the se¬ 
quence with an exponentially decaying weight de¬ 
pending on the length of the span. Owing to the 
low rank representation of the tensor, this oper¬ 
ation can be performed efficiently in linear time 
with respect to the sequence length via dynamic 
programming. Similar to traditional convolution 
operations, our non-linear feature mapping can be 
applied successively at multiple levels. 



We evaluate the proposed arehiteeture in the 
eontext of sentenee sentiment elassifieation and 
news eategorization. On the Stanford Sentiment 
Treebank dataset, our model obtains state-of-the- 
art performanee among a variety of neural net¬ 
works in terms of both aeeuraey and training 
eost. Our model aehieves 51.2% aeeuraey on fine¬ 
grained elassifieation and 88 . 6 % on binary elas¬ 
sifieation, outperforming the best published num¬ 
bers obtained by a deep reeursive model (Tai et ah, 
2015) and a eonvolutional model (Kim, 2014). On 
the Chinese news eategorization task, our model 
aehieves 80.0% aeeuraey, while the elosest base¬ 
line aehieves 79.2%. 

2 Related Work 

Deep neural networks have reeently brought about 
signifieant advaneements in various natural lan¬ 
guage proeessing tasks, sueh as language model¬ 
ing (Bengio et ah, 2003; Mikolov et ah, 2010), 
sentiment analysis (Soeher et ah, 2013; lyyer 
et ah, 2015; Le and Zuidema, 2015), syntaetie 
parsing (Collobert and Weston, 2008; Soeher et 
ah, 2011a; Chen and Manning, 2014) and ma- 
ehine translation (Bahdanau et ah, 2014; Devlin 
et ah, 2014; Sutskever et ah, 2014). Models 
applied in these tasks exhibit signifieant arehi- 
teetural differenees, ranging from reeurrent neu¬ 
ral networks (Mikolov et ah, 2010; Kalehbrenner 
and Blunsom, 2013) to reeursive models (Pollaek, 
1990; Kiiehler and Goller, 1996), and ineluding 
eonvolutional neural nets (Collobert and Weston, 
2008; Collobert et ah, 2011; Yih et ah, 2014; Shen 
et ah, 2014; Kalehbrenner et ah, 2014; Zhang and 
LeCun, 2015). 

Our model most elosely relates to the latter. 
Sinee these models have originally been developed 
for eomputer vision (LeCun et ah, 1998), their 
applieation to NLP tasks introdueed a number of 
modiheations. For instanee, Collobert et al. (2011) 
use the max-over-time pooling operation to aggre¬ 
gate the features over the input sequenee. This 
variant has been sueeessfully applied to seman- 
tie parsing (Yih et ah, 2014) and information re¬ 
trieval (Shen et ah, 2014; Gao et ah, 2014). Kaleh¬ 
brenner et al. (2014) instead propose (dynamie) 
k-max pooling operation for modeling sentenees. 
In addition, Kim (2014) eombines CNNs of dif¬ 
ferent filter widths and either statie or fine-tuned 
word veetors. In eontrast to the traditional CNN 
models, our method eonsiders non-eonseeutive n- 


grams thereby expanding the representation ea- 
paeity of the model. Moreover, our model eap- 
tures non-linear interaetions within n-gram snip¬ 
pets through the use of tensors, moving beyond 
direet linear projeetion operator used in standard 
CNNs. As our experiments demonstrate these ad¬ 
vaneements result in improved performanee. 

3 Background 

Let X G be the input sequenee sueh as a 

doeument or sentenee. Here L is the length of the 
sequenee and eaeh Xj G is a veetor represent¬ 
ing the word. The (eonseeutive) n-gram veetor 
ending at position j is obtained by simply eoneate- 
nating the eorresponding word veetors 

Vj — [Xj—n-|-l) n-|-2) ■ ■ ■ 

Out-of-index words are simply set to all zeros. 

The traditional eonvolution operator is parame¬ 
terized by filter matrix m G ^ whieh ean be 
thought of as n smaller filter matriees applied to 
eaeh x* in veetor Vj. The operator maps eaeh n- 
gram veetor Vj in the input sequenee to vcl^V j G 
so that the input sequenee x is transformed into 
a sequenee of feature representations, 

G 

The resulting feature values are often passed 
through non-linearities sueh as the hyper-tangent 
(element-wise) as well as aggregated or redueed 
by “sum-over” or “max-pooling” operations for 
later (similar stages) of proeessing. 

The overall arehiteeture ean be easily modified 
by replaeing fhe basie n-gram veefors and fhe eon- 
volufion operafion wifh ofher feafure mappings. 
Indeed, we appeal fo tensor algebra fo infroduee a 
non-linear feafure mapping fhaf operafes on non- 
eonseeufive n-grams. 

4 Model 

N-gram tensor Typieal n— gram feafure map¬ 
pings where eoneafenafed word veefors are 
mapped linearly fo feafure eoordinafes may be in- 
suffieienf fo direefly eapfure relevanf information 
in fhe n— gram. As a remedy, we replaee eoneafe- 
nafion wifh a fensor produef. Consider a 3-gram 
(xi, X 2 , X 3 ) and fhe eorresponding fensor produef 
xi (g) X 2 ( 8 ) X 3 . The fensor produef is a 3-way ar¬ 
ray of eoordinafe inferaefions sueh fhaf eaeh ijk 



entry of the tensor is given by the produet of the 
eorresponding eoordinates of the word veetors 

(Xi (g) X 2 (g) X 3 ) = Xij • X 2 j • X 3 fc 

Here (g) denotes the tensor produet operator. The 
tensor produet of a 2 -gram analogously gives a 
two-way array or matrix xi (g) X 2 G The n- 

gram tensor ean be seen as a direet generalization 
of the typieal eoneatenated veetor^. 

Tensor-based feature mapping Sinee eaeh n- 
gram in the sequenee is now expanded into a 
high-dimensional tensor using tensor produets, the 
set of filters are analogously maintained as high- 
order tensors. In other words, our filters are linear 
mappings over the higher dimensional interaetion 
terms rather than the original word eoordinates. 

Consider again mapping a 3-gram (xi,X 2 ,X 3 ) 
into a feature representation. Eaeh filter is a 3-way 
tensor with dimensions dx dx d. The set of h fil¬ 
ters, denoted as T, is a 4-way tensor of dimension 
d X d X d X h, where eaeh d^ sliee of T repre¬ 
sents a single filter and h is the number of sueh 
filters, i.e., the feature dimension. The resulting 
/i—dimensional feature representation z G for 

the 3-gram (xi, X 2 , X 3 ) is obtained by multiplying 
the filter T and the 3-gram tensor as follows. The 
eoordinate of z is given by 

Zl = '^ T^ijkl • (xi (g) X 2 (g) X 3 ) 
ijk 

^ ^ '^ijkl ■ Xli • X2j • X3/J (1) 

ijk 

The formula is equivalent to summing over all 
the third-order polynomial interaetion terms where 
tensor T stores the eoeffieients. 

Direetly maintaining the filters as full tensors 
leads to parametrie explosion. Indeed, the size of 
the tensor T (i.e. h x d^) would be too large even 
for typieal low-dimensional word veetors where, 
e.g., d = 300. To this end, we assume a low-rank 
factorization of the tensor T, represented in the 
Kruskal form. Speeifieally, T is deeomposed into 
a sum of h rank -1 tensors 

h 

T = ^ Pi (g) Qi (g) Rj (g) Oi 

i=l 

^To see this, consider word vectors with a “bias” term 
x/ = [xi; 1]. The tensor product of n such vectors includes 
the concatenated vector as a subset of tensor entries but, in 
addition, contains all up to n*-order interaction terms. 


where P, Q, R G and O G are four 

smaller parameter matriees. P* (similarly Qj, Rj 
and Oj) denotes the row of the matrix. Note 
that, for simplieity, we have assumed that the num¬ 
ber of rank -1 eomponents in the deeomposition 
is equal to the feature dimension h. Plugging 
the low-rank faetorization into Eq.(l), the feature¬ 
mapping ean be rewritten in a veetor form as 

z = o"’" (Pxi © Qx 2 0 Rxs) (2) 

where © is the element-wise produet sueh that, 
e.g., (a © b)k = Ok X bk for a,b € M™. Note 
that while Pxi (similarly Qx 2 and RX 3 ) is a lin¬ 
ear mapping from eaeh word xi (similarly X2 and 
X3) into a /i-dimensional feature spaee, higher or¬ 
der terms arise from the element-wise produets. 

Non-consecutive n-gram features Traditional 
eonvolution uses eonseeutive n-grams in the fea¬ 
ture map. Non-eonseeutive n-grams may nev¬ 
ertheless be helpful sinee phrases sueh as “ not 
good”, “ not so good” and “not nearly as good” ex¬ 
press similar sentiments but involve variable spae- 
ings between the key words. Variable spaeings are 
not effeetively eaptured by fixed n-grams. 

We apply the feature-mapping in a weighted 
manner to all n-grams thereby gaining aeeess to 
patterns sueh as “ not ... good”. Eet z[i,j, k] G 
denote the feature representation eorresponding to 
a 3-gram (xj,Xj,x^) of words in positions i, j, 
and k along the sequenee. This veetor is ealeu- 
lated analogously to Eq.(2), 

z[i,j,k] = o'*" (Pxj © Qxj © Rxfc) 

We will aggregate these veetors into an 
/i—dimensional feature representation at eaeh 
position in the sequenee. The idea is similar to 
neural bag-of-words models where the feature 
representation for a doeument or sentenee is 
obtained by averaging (or summing) of all the 
word veetors. In our ease, we define the aggregate 
representation Z 3 [k] in position k as the weighted 
sum of all 3-gram feature representations ending 
at position k, i.e., 

Z3[fc]= 

i<j<k 

= z[i,j,A:].A"-*-2 (3) 

i<j<k 

where A G [0,1) is a deeay faetor that down¬ 
weights 3-grams with longer spans (i.e., 3-grams 



that skip more in-between words). As A —)■ 0 
all non-consecutive 3-grams are omitted, zsik] = 
z[k — 2,k — Ijk], and the model acts like a 
traditional model with only consecutive n-grams. 
When A > 0, however, [k] is a weighted aver¬ 
age of many 3-grams with variable spans. 

Aggregating features via dynamic program¬ 
ming Directly calculating 2:3 [•] according to 
Eq.(3) by enumerating all 3-grams would require 
0{L^) feature-mapping operations. We can, how¬ 
ever, evaluate the features more efficiently by re¬ 
lying on the associative and distributive properties 
of the feature operation in Eq.(2). 

Eet /s [k] be a dynamic programming table rep¬ 
resenting the sum of 3-gram feature representa¬ 
tions before multiplying with matrix O. That is, 
zs[k] = 0 ^/ 3 [/c] or, equivalently, 

fsik] = Y. 

i<j<k 

We can analogously define fi[i] and f 2 [j] for 1- 
grams and 2 -grams, 

fi [i] = Pxj 

/2[j] = Y ® 

i<j 

These dynamic programming tables can be calcu¬ 
lated recursively according to the following for¬ 
mulas: 


/iw = 

PXj 

si[i] = 

X- si[i-l] + fi [i] 

/2b1 = 

St [j - 1] © QXj 

S2[j] = 

A • S2[j - 1] + / 2 b] 

h[k] = 

S 2 [k - 1]Q Rxfc 

z[k] = 

0'^{h[k]+f2[k]+h[k]) 


where si[-] and S 2 ['] are two auxiliary tables. The 
resulting z[-] is the sum of 1, 2, and 3-gram fea¬ 
tures. We found that aggregating the 1,2 and 3- 
gram features in this manner works better than us¬ 
ing 3-gram features alone. Overall, the n-gram 
feature aggregation can be performed in 0{Ln) 
matrix multiplication/addition operations, and re¬ 
mains linear in the sequence length. 


The overall architecture The dynamic pro¬ 
gramming algorithm described above maps the 
original input sequence to a sequence of feature 
representations z = z[l : L] £ As in 

standard convolutional architectures, the resulting 
sequence can be used in multiple ways. One can 
directly aggregate it to a classifier or expose it to 
non-linear element-wise transformations and use 
it as an input to another sequence-to-sequence fea¬ 
ture mapping. 

The simplest strategy (adopted in neural bag- 
of-words models) would be to average the fea¬ 
ture representations and pass the resulting aver¬ 
aged vector directly to a softmax output unit 

i=l 

y = softmax 

Our architecture, as illustrated in Eigure 1, in¬ 
cludes two additional refinements. Eirst, we add 
a non-linear activation function after each feature 
representation, i.e. z' = ReEU (z -|- b), where b 
is a bias vector and ReEU is the rectified linear 
unit function. Second, we stack multiple tensor- 
based feature mapping layers. That is, the input 
sequence x is first processed into a feature se¬ 
quence and passed through the non-linear trans¬ 
formation to obtain z^^\ The resulting feature 
sequence is then analogously processed by 
another layer, parameterized by a different set of 
feature-mapping matrices P, • • • , O, to obtain a 
higher-level feature sequence z^‘^\ and so on. The 
output feature representations of all these layers 
are averaged within each layer and concatenated 
as shown in Eigure 1. The final prediction is there¬ 
fore obtained on the basis of features across the 
levels. 

5 Learning the Model 

Eollowing standard practices, we train our model 
by minimizing the cross-entropy error on a given 
training set. Eor a single training sequence x and 
the corresponding gold label y G [0,1]*”, the error 
is defined as, 

m 

loss (x, y) = Yyi log im) 
i=i 

where m is the number of possible output label. 

The set of model parameters (e.g. P, • • • , O 
in each layer) are updated via stochastic gradient 
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Figure 1: Illustration of the model architecture. The input is represented as a matrix where each row is a 
d-dimensional word vector. Several feature map layers (as described in Section 4) are stacked, mapping 
the input into different levels of feature representations. The features are averaged within each layer and 
then concatenated. Finally a softmax layer is applied to obtain the prediction output. 


descent using AdaGrad algorithm (Duchi et ah, 

2011 ). 


Initialization We initialize matrices P,Q,R 


from uniform distribution 
similarly O U 




and 


— \j2>/h, yj^jh . In this way, 
each row of the matrices is an unit vector in expec¬ 
tation, and each rank-1 filter slice has unit variance 
as well. 


E [||Pi(8)Qi(8)Ri(8)Oif] = 1 

In addition, the parameter matrix W in the soft- 
max output layer is initialized as zeros, and the 
bias vectors b for ReLU activation units are ini¬ 
tialized to a small positive constant 0.01. 

Regularization We apply two common tech¬ 
niques to avoid overfitting during training. First, 
we add L2 regularization to all parameter values 
with the same regularization weight. In addition, 
we randomly dropout (Hinton et ah, 2012) units 
on the output feature representations at each 
level. 

6 Experimental Setup 

Datasets We evaluate our model on sentence 
sentiment classification task and news categoriza¬ 
tion task. For sentiment classification, we use the 
Stanford Sentiment Treebank benchmark (Socher 
et ah, 2013). The dataset consists of 11855 
parsed English sentences annotated at both the 
root (i.e. sentence) level and the phrase level us¬ 
ing 5-class fine-grained labels. We use fhe sfan- 


dard 8544/1101/2210 splif for fraining, develop- 
menf and fesfing respectively. Following previ¬ 
ous work, we also evaluafe our model on fhe bi¬ 
nary classificafion varianf of fhis benchmark, ig¬ 
noring all neufral senfences. The binary version 
has 6920/872/1821 senfences for fraining, devel- 
opmenf and fesfing. 

For fhe news cafegorizafion fask, we evaluafe on 
Sogou Chinese news corpora.^ The dafasef con- 
fains 10 differenl news cafegories in fofal, includ¬ 
ing Finance, Sporfs, Technology and Automobile 
efc. We use 79520 documenfs for fraining, 9940 
for developmenf and 9940 for fesfing. To obfain 
Chinese word boundaries, we use LTP-Cloud^, an 
open-source Chinese NLP plafform. 

Baselines We implemenf fhe sfandard SVM 
mefhod and fhe neural bag-of-words model 
NBoW as baseline mefhods in bofh fasks. To as¬ 
sess fhe proposed fensor-based fealure map, we 
also implemenf a convolutional neural nefwork 
model CNN by replacing our filler wifh Iradilional 
linear filler. The resl of fhe framework (such as 
fealure averaging and concalenalion) remains fhe 
same. 

In addifion, we compare our model wifh a wide 
range of fop-performing models on fhe senfence 
senlimenl classificafion fask. Mosl of Ihese mod¬ 
els fall info eilher fhe calegory of recursive neural 
nefworks (RNNs) or fhe calegory of convolutional 
neural nefworks (CNNs). The recursive neural 

^http://WWW.sogou.com/labs/dl/c.html 

^http://www.Itp-cloud.com/intro/en/ 
https://github.com/HIT-SCIR/itp 





























































Model 

Fine-grained 

Binary 

Time (in seconds) 

Dev Test 

Dev Test 

per epoch 

per 10k samples 

RNN 

43.2 

82.4 

- 

- 

RNTN 

45.7 

85.4 

1657 

1939 

DRNN 

49.8 

86.8 

431 

504 

RFSTM 

51.0 

88.0 

140 

164 

DCNN 

48.5 

86.9 

- 

- 

CNN-MC 

47.4 

88.1 

2452 

156 

CNN 

48.8 47.2 

85.7 86.2 

32 

37 

PVEC 

48.7 

87.8 

- 

- 

DAN 

48.2 

86.8 

73 

5 

SVM 

40.1 38.3 

78.6 81.3 

- 

- 

NBoW 

45.1 44.5 

80.7 82.0 

1 

1 

Ours 

49.5 50.6 

87.0 87.0 

28 

33 

- 1 - phrase labels 

53.4 51.2 

88.9 88.6 

445 

28 


Table 1: Comparison between our model and other baseline methods on Stanford Sentiment Treebank. 
The top block lists recursive neural network models, the second block are convolutional network mod¬ 
els and the third block contains other baseline methods, including the paragraph-vector model (Le and 
Mikolov, 2014), the deep averaging network model (lyyer et al., 2015) and our implementation of neural 
bag-of-words. The training time of baseline methods is taken from (lyyer et al, 2015) or directly from 
the authors. For our implementations, timings were performed on a single core of a 2.6GHz Intel i7 
processor. 


network baselines include standard RNN (Socher 
et al., 2011b), RNTN with a small core tensor in 
the composition function (Socher et al., 2013), the 
deep recursive model DRNN (Irsoy and Cardie, 

2014) and the most recent recursive model using 
long-short-term-memory units RLSTM (Tai et al., 

2015) . These recursive models assume the in¬ 
put sentences are represented as parse trees. As 
a benefit, they can readily utilize annotations at 
the phrase level. In contrast, convolutional neu¬ 
ral networks are trained on sequence-level, taking 
the original sequence and its label as training in¬ 
put. Such convolutional baselines include the dy¬ 
namic CNN with k-max pooling DCNN (Kalch- 
brenner et al., 2014) and the convolutional model 
with multi-channel CNN-MC by Kim (2014). To 
leverage the phrase-level annotations in the Stan¬ 
ford Sentiment Treebank, all phrases and the cor¬ 
responding labels are added as separate instances 
when training the sequence models. We follow 
this strategy and report results with and without 
phrase annotations. 

Word vectors The word vectors are pre-trained 
on much larger unannotated corpora to achieve 
better generalization given limited amount of 
training data (Turian et al., 2010). In particu¬ 
lar, for the English sentiment classification task. 


we use the publicly available 300-dimensional 
GloVe word vectors trained on the Common Crawl 
with 840B tokens (Pennington et al., 2014). This 
choice of word vectors follows most recent work, 
such as DAN (lyyer et al., 2015) and RLSTM (Tai 
et al., 2015). For Chinese news categorization, 
there is no widely-used publicly available word 
vectors. Therefore, we run word2vec (Mikolov 
et al., 2013) to train 200-dimensional word vec¬ 
tors on the 1.6 million Chinese news articles. Both 
word vectors are normalized to unit norm (i.e. 
||ru ||2 = 1) and are fixed in fhe experimenfs wifh- 
ouf fine-funing. 

Hyperparameter setting We perform an exten¬ 
sive search on the hyperparameters of our full 
model, our implementation of the CNN model 
(with linear filters), and the SVM baseline. For 
our model and the CNN model, the initial learn¬ 
ing rate of AdaGrad is fixed to 0.01 for sentiment 
classification and 0.1 for news categorization, and 
the F2 regularization weight is fixed to le — 5 
and le — 6 respectively based on preliminary runs. 
The rest of the hyperparameters are randomly cho¬ 
sen as follows: number of feature-mapping lay¬ 
ers G {1,2,3}, n-gram order n G {2,3}, hidden 
feature dimension h G {50,100, 200}, dropout 
probability G {0.0, 0.1,0.3,0.5}, and length de- 



cay A G {0.0,0.3,0.5}. We run each config¬ 
uration 3 times to explore different random ini¬ 
tializations. For the SVM baseline, we tune L2 
regularization weight C G {0.01,0.1,1.0,10.0}, 
word cut-off frequency G {1, 2, 3, 5} (i.e. pruning 
words appearing less than this times) and n-gram 
feature order n G {1, 2, 3}. 

Implementation details The source code is 
implemented in Python using the Theano li¬ 
brary (Bergstra et ah, 2010), a flexible lin¬ 
ear algebra compiler that can optimize user- 
specified computations (models) with efficient 
automatic low-level implementations, including 
(back-propagated) gradient calculation. 

7 Results 

7,1 Overall Performance 

Table 1 presents the performance of our model 
and other baseline methods on Stanford Sentiment 
Treebank benchmark. Our full model obtains the 
highest accuracy on both the development and test 
sets. Specifically, it achieves 51.2% and 88.6% 
test accuracies on fine-grained and binary tasks re¬ 
spectively^. As shown in Table 2, our model per¬ 
formance is relatively stable - it remains high ac¬ 
curacies with around 0.5% standard deviation un¬ 
der different initializations and dropout rates. 

Our full model is also several times faster than 
other top-performing models. For example, the 
convolutional model with multi-channel (CNN- 
MC) runs over 2400 seconds per training epoch. 
In contrast, our full model (with 3 feature layers) 
runs on average 28 seconds with only root labels 
and on average 445 seconds with all labels. 

Our results also show that the CNN model, 
where our feature map is replaced with traditional 
linear map, performs worse than our full model. 
This observation confirms the importance of the 
proposed non-linear, tensor-based feature map¬ 
ping. The CNN model also lags behind the DCNN 
and CNN-MC baselines, since the latter two pro¬ 
pose several advancements over standard CNN. 

Table 3 reports the results of SVM, NBoW and 
our model on the news categorization task. Since 
the dataset is much larger compared to the senti¬ 
ment dataset (80K documents vs. 8.5K sentences), 
the SVM method is a competitive baseline. It 
achieves 78.5% accuracy compared to 74.4% and 

^Best hyperparameter configuration based on dev accu¬ 
racy: 3 layers, 3-gram tensors (n=3), feature dimension d — 
200 and length decay A = 0.5 



Dataset 

Accuracy 

Fine-grained 

Dev 

52.5 (±0.5) % 

Test 

51.4 (±0.6) % 

Binary 

Dev 

88.4 (±0.3) % 

Test 

88.4 (±0.5) % 


Table 2: Analysis of average accuracy and stan¬ 
dard deviation of our model on sentiment classifi¬ 
cation task. 


Model 

Dev Acc. 

Test Acc. 

SVM (1-gram) 

77.5 

77.4 

SVM (2-gram) 

78.2 

78.0 

SVM (3-gram) 

78.2 

78.5 

NBoW 

74.4 

74.4 

CNN 

79.5 

79.2 

Ours 

80.0 

80.0 


Table 3: Performance of various methods on Chi¬ 
nese news categorization task. Our model obtains 
better results than the SVM, NBoW and traditional 
CNN baselines. 

79.2% obtained by the neural bag-of-words model 
and CNN model. In contrast, our model obtains 
80.0% accuracy on both the development and test 
sets, outperforming the three baselines by a 0.8% 
absolute margin. The best hyperparameter con¬ 
figuration in this task uses less feature layers and 
lower n-gram order (specifically, 2 layers and n = 
2 ) compared to the sentiment classification task. 
We hypothesize that the difference is due to the 
nature of the two tasks: the document classifica¬ 
tion task requires to handle less compositions or 
context interactions than sentiment analysis. 

7.2 Hyperparameter Analysis 

We next investigate the impact of hyperparame¬ 
ters in our model performance. We use the mod¬ 
els trained on fine-grained sentiment classification 
task with only root labels. 

Number of layers We plot the fine-grained sen¬ 
timent classification accuracies obtained during 
hyperparameter grid search. Figure 2 illustrates 
how the number of feature layers impacts the 
model performance. As shown in the figure, 
adding higher-level features clearly improves the 
classification accuracy across various hyperpa¬ 
rameter settings and initializations. 

Non-consecutive n-gram features We also an¬ 
alyze the effect of modeling non-consecutive n- 




-1 

-2 - 

the movie is good 

(1) positive prediction 



the movie is bad 


(2) negative prediction 



the movie is not good 


(3) negative prediction 


■ * 

the movie is not bad 

(4) positive prediction 



the movie is neither good nor bad 


(5) negative prediction 



no movement , no yuks , not much of anything 


(6) negative prediction (ground truth: negative) 



too bad , but thanks to somelovely .. and several fine .. , it ‘s not a totai ioss 

(7) positive prediction (ground truth: positive) 


Figure 5: Example sentences and their sentiments predicted by our model trained with root labels. The 
predicted sentiment scores at each word position are plotted. Examples (l)-(5) are synthetic inputs, (6) 
and (7) are two real inputs from the test set. Our model successfully identifies negation, double negation 
and phrases with different sentiment in one sentence. 



43.5% 45.1% 46.8% 48.4% 50.0% 


Eigure 2: Dev accuracy (x-axis) and test accuracy 
(y-axis) of independent runs of our model on fine¬ 
grained senfimenf classification fask. Deeper ar- 
chifecfures achieve beffer accuracies. 

grams. Eigure 3 splifs fhe model accuracies ac¬ 
cording fo fhe choice of span decaying facfor A. 
Nofe when A = 0, fhe model applies fealure ex- 
fracfions fo consecufive n-grams only. As shown 
in Eigure 3, fhis setting leads fo consisfenf perfor¬ 
mance drop. This resulf confirms fhe imporfance 
of handling non-consecufive n-gram patterns. 

Non-linear activation Einally, we verify the ef¬ 
fectiveness of rectified linear unit activation func- 


o decay=0.0 ^ decay=0.3 □ decay=0.5 



Eigure 3: Comparison of our model variations 
in sentiment classification task when considering 
consecutive n-grams only (decaying factor A = 0) 
and when considering non-consecutive n-grams 
(A > 0). Modeling non-consecutive n-gram fea¬ 
tures leads to better performance. 


tion (ReEU) by comparing it with no activation (or 
identity activation f{x) = x). As shown in Eig¬ 
ure 4, our model with ReEU activation generally 
outperforms its variant without ReEU. The obser¬ 
vation is consistent with previous work on convo¬ 
lutional neural networks and other neural network 
models. 


















































o None 


V ReLU 



Figure 4: Applying ReLU activation on top of 
tensor-based feature mapping leads to better per¬ 
formance in sentiment classification task. Runs 
with no activation are plotted as blue circles. 

7.3 Example Predictions 

Figure 5 gives examples of input sentences and 
the corresponding predictions of our model in 
fine-grained senfimenf classification. To see how 
our model capfures fhe senfimenf af differenf lo¬ 
cal confexf, we apply fhe learned soffmax ac- 
fivafion fo fhe exfracfed feafures af each posi¬ 
tion wifhouf faking fhe average. Thaf is, for 
each index i, we obfain fhe local senfimenf p = 
soffmax (W'’" We 

plof fhe expecfed senfimenf scores Yl‘l =-2 
where a score of 2 means “very posifive”, 0 means 
“neufral” and -2 means “very negative”. As shown 
in fhe figure, our model successfully learns nega¬ 
tion and double negafion. The model also iden¬ 
tifies posifive and negafive segmenfs appearing in 
fhe senfence. 

8 Conclusion 

We proposed a feafure mapping operafor for con- 
volufional neural nefworks by modeling n-gram 
inferacfions based on fensor producf and evaluaf- 
ing all non-consecufive n-gram vecfors. The as- 
sociafed paramefers are mainfained as a low-rank 
tensor, which leads fo efficienl feafure exfracfion 
via dynamic programming. The model achieves 
lop performance on sfandard senfimenf classifica¬ 
tion and documenf cafegorizafion fasks. 
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